306 ◾ Bioinformatics
ends of the reads, and to remove adaptors and duplicates. Refer to Chapter 1 for detailed
information about this step. For multiplexing data, you need to perform demultiplexing
before you do the quality control. The multiplexing and demultiplexing are discussed in
Chapter 7. The FASTQ files, which we have downloaded, had already been processed and
they contain reads of good quality. You can check their quality with FastQC as follows:
fqs=$(ls fastqdir/*.fastq)
fastqc $fqs
htmls=$(ls fastqdir/*.html)
firefox $htmls
The above commands will display the quality control report of the six FASTQ files on the
Firefox browser. Check on the six tabs to study the reports.
8.2.3 Removing Host DNA Reads
Metagenomic data recovered from clinical samples is usually mixed with the host genomic
DNA sequences. These sequences, which represent untargeted fraction of data, must be
filtered out before the subsequent step of the analysis. Any other untargeted sequences can
also be removed following the step of removing host sequences. The process of removing
the host sequences begins by aligning raw data to the reference genome of the host. The
host sequences will map to the reference genome, whereas the metagenomic reads will not
map. Thus, after the mapping process, we can extract the unmapped sequences and store
them in separate FASTQ files. For paired-end reads, we will have two FASTQ files repre-
senting the raw metagenomic data without the host sequences.
Since the host of our data is human, we will align reads to the human reference genome.
We have already discussed read mapping in Chapter 2 and other chapters as well. This
time we will use Bowtie2 aligner. We can walk you through the steps without repeating
the discussion. The following are the steps to remove the human host sequences from the
genomic data.
8.2.3.1 Download Human Reference Genome
You did this step before in Chapter 6. So, if you have the human reference genome and
Bowtie2 index saved in your drive, you can use them instead since building the Bowtie2
index may take some time. If you do not have those files stored on your computer, run the
following command in your project working directory to download the FASTA sequence of
the human reference genome, decompress it, and index it with both Samtools and Bowtie2:
mkdir ref; cd ref
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.
fa.gz
gunzip -d hg19.fa.gz
samtools faidx hg19.fa
bowtie2-build hg19.fa hg19
cd ..